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Improvements Relating to Parallel Data Processing 

Introduction 

The present invention concerns improvements relating to parallel data processing. The 
invention is intended for use in digital computers. More specifically, digital computers 
that have a Single Instruction Multiple Data (SIMD) data processor. The purpose of 
the invention is to accelerate the inter-processor communications for a specific sub-class 
of operation known as alternation. 

Efficient inter-processor communications in SIMD data parallel processors is 
paramount to performance. The requirement is to provide efficient hardware support 
for binary f divide-and-conquer f algorithms, whereby the multitude of processors in a 
data parallel architecture can be sub-divided into even/odd subsets in an efficient 
manner. 
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Figure 1 Divide-and-conquer principal of operation 

Figure 1 illustrates this principal, showing the summation of 8 numbers by successively 
adding even-odd pairs in log 2 N steps, where N=8. 

The major attributes of such a hardware network, known as the alternation network 
are: 

• Provision of hardware augmentation for divide-and-conquer algorithms. 

• Transparent operation across chip boundaries in modular multi-chip solutions with a 
minimum of external hardware support. 

• Operation to complete in a minimum number of clock cycles - ideally one -reasonably 
independent of the number of devices participating in the operation. 

Although the invention is general in its application early instances of it have been 
manufactured for use with the Aspex ASP (Associative String Processor) data parallel 
processor. The ASP is a SIMD data processor that in typical configurations operates on 
1024 to 65536 data items in parallel. Some of the invention's detail is specific to this 
data processor. The major attributes of the ASP are: 

• 1152 processing elements on a single device 12.0mm x 12.0mm in size. 

• 100MHz clock speed. 

Future versions of this processor will incorporate up to 8192 processing elements. 
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One of the features of the ASP processor is the string (one-dimensional) topology of the 
processor array, wherein the processing elements are connected end-to-end. This 
facilitates some of the features of the network described in this invention. 

Prior Art 

Simple solution based on EXOR gates 

The simplest solution for an alternate operation is to provide a linear string of EXOR 
(Exclusive-OR) gates, one per processor, connected so that the control signal is local to 
the processor, and the communication signal propagates to the next processor. 

Figure 2 shows such a solution. The data processor's instruction generates a status 
signal from each processor called a match, where the match signal from the nth 
processor is termed M . The application of the match signals causes the generation of a 
signal which propagates between processors, swapping state whenever it encounters a 
condition where a local match (M ) is true. By combining the propagation term (LAO,,) 

D 

in true or complement form with the original match signal, then two new match states 
can be generated, termed M n andM n . 



Mo_cvcn Mo.ocM Macro, M^odd 




Mo M„ 

Figure 2 Basic hardware alternation network 



The versatility of this solution is that it allows the even and odd match sets to be 
easily generated, offering application flexibility. Its major disadvantage is the poor 
speed of the solution, since the propagating signal is limited by the gate delay of a single 
EXOR gate and the number of processors in the string. 



An Example 

An example of an alternation network in the VASP-256 data parallel processor 
comprises a network of 256 EXOR gates, one per processor. Execution of an alternate 
instruction requires approximately 11 clock cycles (at 50MHz) to completely generate 
an even-odd partition of M fl to 



2 



Invention 
Introduction 

The invention is a new alternation network architecture that achieves the functionality 
of generating alternating match sets at a vastly improved speed over the current art It 
does this by generating an initial alternate locally within a segment, then generating a 
correction term based on the forward propagation of the output of the local segment 
EXOR chains. This output provides information about whether the total number of 
matches within a segment are even or odd. This correction term is forward propagated 
through the network, utilising a hierarchy of bypass levels, which further accelerate the 
propagation of information. In a subsequent cycle, the correction term is utilised to 
generate a new input to the local segment, which then serves to cause the segment to 
maintain or reverse initial alternate to achieve the correct final state. 

The novelty of the invention is in: 

• The generation of a pair of match vectors, representing alternating sets of even and 
odd matching processors • 

• The use of a circuit for multiplying the rate at which the final alternating sets are 
generated. 

• The capability to partition the alternating sets into local subsets. 

• The capability to control the inject condition at the head of each subset, thereby 
controlling the generation of the alternating sets such that the leftmost matching APE 
may belong to the even or odd sets. 

• The capability to exploit the topology of the network to implement fault-tolerance 
with no further logic. 

Network Terminology and Definitions 

Segment A collection of APEs, typically 16. 
Block A collection of segments, typically 8. 

Chip A collection of blocks. 

The Alternating Communications Network Detail 
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Figure 3 shows the basic modular Alternation Network cell. 
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Figure 3 Alternation network low-level module 



The module labelled ALTERNATE NETWORK may be considered to correspond to 
the network shown in Figure 2. Typically this portion of the network comprises only a 
few stages (processing elements), since propagation through this network must be 
carried out twice, 

1. as a pre-condition to the propagation of correction terms, and 

2. as a post-condition to apply the correction terms for the final generation of the even- 
odd sets. 



Consequently this sub-network must be kept as short as possible, consistent with the 
requirements of the floorplan. 

The module labelled *P 5 (partition switch) is a simple switch function controlled by a 
signal P $ where s denotes the segment identifier. For example s=0 is considered to be 
the leftmost segment If this signal is TRUE then the carry forward of the correction 
term via the partition switch is disabled. 

If this signal is TRUE then communication into this segment is also disabled, a task 
handled by the module labelled INJECT NETWORK, which controls the insertion of a 
propagating signal into the leftmost end of the ALTERNATE NETWORK, as well as 
responding to the partition control signal (PJ to regulate input to the network. 



Inject Network Detail 

Figure 4 is a block diagram of the inject network. 



4 



# • 



GND 
!P,&& !C 

input 
!P S &&C 



3> 



isolation 
flip-flop V 



E && !C 



OR 




EN 



CLOCK 

Figure 4 Inject network detail 



The signal T represents the global inject state. The inject state may be used to 
determine whether the leftmost participating processor in a subnet belongs to the even 
or odd sets. This inject state is only applied if the propagate term for the given segment 
(P s ) is TRUE, resulting in this subnet being partitioned from segments to its left. The 
global inject state then "overrides all forward propagating correction terms into the 
network input and unconditionally becomes the input to the ALTERNATE 
NETWORK. 

The signal C is true during the correction phase of the network operation, when 
communication network will be generating a carry forward correction term into the 
network input which will be used to modify the initial state of the ALTERNATE 
NETWORK (provided that P 8 is FALSE). The initial state of the ALTERNATE 
NETWORK is used as the basis of generating the correction terms and propagating 
them forward through the network, hence the correction term is isolated from the 
ALTERNATE NETWORK by the isolation flip-flop. 

Once the correction term has been generated and is stable, it will be clocked through to 
the ALTERNATE NETWORK during the evaluate phase of the network operation, 
when the correction term will be enabled through the isolation flip-flop. 

When the signal C is false, then the isolation flip-flop will be pre(set) or pre(reset) 
depending upon the state of the inject input (I) and the partition flag (PJ. 

Block-level Hierarchy 

The module shown in Figure 3 can now be readily cascaded to form a higher-order 
network. For example, Figure 5 shows a number of identical networks of the type 
shown in Figure 3 cascaded to form a block network. 
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see Figure 3 
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Figure 5 Block-level hierarchical network topology 

The block partition switch is controlled by a signal (P b ) which is the logical 
combination of individual segment partition (P 8 ) control signals. In this way, only if all 
segment partition switches within the span of a block are closed is the forward 
propagation of a correction term enabled. 



Such a hierarchical arrangement can be extended indefinitely, for example by 
cascading block level networks (Figure 5), plus the addition of the corresponding 
external -network (Figure 6). In this example it is assumed that a chip comprises an 
array of blocks, as per the definitions above. 



see Figure 5 
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Inter-chip Network Topology 

The hierarchical network topology can be readily extended within the constraints of : 

• the target speed of the correction phase and 

• the floorplan topology of the integrated circuit. 

Therefore the hierarchy within a given chip may be extended to comprise more than the 
two levels described above. 
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Ultimately however, the communications signals must propagate off-chip, supporting 
the goal of transparent chip-chip communication, as well as supporting the 
continuation of the network logical hierarchy with a minimum of external logic. 

/ chip 
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Figure 7 Multi-chip network topology 



Figure 7 illustrates how the network hierarchy may be readily extended across multiple 
chips. In this example a new signal (labelled BYP in the diagram) has been introduced, 
which is assumed to be TRUE if the chip may be bypassed (Le. no open switch exists 
anywhere within the device). 
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The primary input (LAI2) is 
0 (even) and the proceeding 
segment output is 0 (even) 
therefore a correction signal 
(active HIGH) is not 
generated. 
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Figure 8 Principal of operation 



Principal of Operation 

The principal of operation of the network is illustrated in Figure 8, which shows a 
representative block comprising three segments. In this example, no partition flags are 
set, therefore the network is continuous and unbroken along its length. The initial state 
of the individual segments is evaluated in parallel (assuming a default inject input set 
FALSE), resulting in indications of even and odd match counts. These are then 
combined by the carry forward network to produce the correction terms. The role of 
the JEXOR networks becomes clear, since the carry forward of pairs of odd correction 
terms (FALSE, FALSE) or pairs of even correction terms (TRUE, TRUE) always 
results in the generation and propagation of an even (TRUE) carry forward term. 
Similarly, the application of, say, an even and an odd correction term (TSUET^PSIJSE) 
results in the propagation of an odd correction term (FALSE).. 

The duration of generation and propagation of the correction terms is determined by 
the internal modular hierarchy within the network. 

The final stable correction terms are applied on a subsequent clock edge, when the 
isolation flip-flop in the inject network is clocked. Following the application of the 
correction term, a sufficient interval must be allowed for the correction term to 
propagate through the individual segment networks to achieve the final even-odd split. 
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Whenever an operation is extended (such as the generation and propagation of the 
correction terms) this is implemented in the context of a synchronous processor device 
by issuing repeated instructions according to a deterministic (compiled) algorithm. 

Instruction Repetition 

A typical usage of the network requires a sequence of one or more of the correction data 
processor instructions, followed by a evaluation processor instruction and a single 
propagation instruction. Control logic within the chip instruction unit allows the length 
of this sequence to be deterministic. 

An Example 

Consider the following example. The instruction sequence that needs to be fed to the 
data processor for a particular application is: 

{{A}RepeatSlot(4),B} 

where {..} RepeatSlot(N) means repeat the instruction inside the braces N times. 



Example ASP alternation Code 

Consider the following example. Count participating APEs and write the count into the 
data register in the rightmost APE in the string. 

The pseudo-code for this is: 

identify participating APEs; 
A = 0; 

FOR X = LSB TO MSB DO 

alternate participating APEs into even and odd sets; 

IF rightmost APEs is in odd set THEN 

A [rightmost] [X]=l; (* Xth bit of rightmost APE is set *) 

mask odds and reassign evens as new participating APEs 
END • *' " . 

The actual ASP code to perform this task is given below: 



Initialise 

{ { InitWrit eSerial ( every, slOpen, dX, dX, 0 , 0 , aw { ab2 . dO , abl . dO } , NoRead) } , 
{ExecWrite (bmD, NoCsoOpt) } } ; 

Start of loop, initialise for counter and set intra and inter chip inject 
inputs to TRUE 

{ Set ( agbln j ect ) / 
For (sfsz, 1) , 
{Segment Inject (1) } , 
{NoOpO}}; 

Finding participating processors and generate alternate even-odd subsets 
{ { InitSearchOrAddSerial (dX, dX, 0 , 0, aw { ab2 . d0 f abO . dl } ) } , 
{ExecSearchOrAddSerial (bmD, trAlt , Not Force Zero) , 
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RepeatSlot (AlternateSlots) } } ; 

Set intra-chip inject inputs FALSE 
f (Seqment Inject (0) } / 
{NoOpO}}; 

Set inter-chip inject input FALSE 
{Clear (agblnject) , 

{ InitWriteSerial (at, slOpen, dX 7 dX, 0, 0, aw{ abl . dl } , NoRead) } , 
{ ExecWrite (bmD, NpCsoOpt ) } } ; 

Establish that rightmost participating APE is odd 

{ { InitWriteTernary ( anl , slOpen, t f 0 , XT, ScalarBus , aw{ } ) } , 
{NoOpO}}; 

{ {InitWriteSerial (all, slOpen, dX, dX, 0, 0, aw{ab2 .dl} , NoRead) , 
RepeatSlot (ActivateSlots) } , 

{ExecWrite (bmD, NoCsoOpt) } } ; 
{ { InitSearchOrAddSerial (dX, dX, 0 , 0 , aw { abO . dl , abl . dO , ab2 • dO } ) } , 

{.ExecSearchOrAddSerial (bmD, trl, NotForceZero) } } ; 
{ {InitWriteTernary (anr, slOpen, tf0,XT, ScalarBus, aw{ } ) , 

{NoOpO }}; 

Write result LSB 

{Init(sfal,sfbal), 

{ InitWriteSerial (aerf , slOpen, dl, dX, sf al, 0, aw{ ab2 . dl } , NoRead) , 
RepeatSlot (ActivateSlots ) } , 
O, 

Inc(sfal) }; 
{ O, 

{ ExecWrite (bmD, NoCsoOpt ) } } ; 

Mask current odd set from participating in next cycle of alternates 
{ {InitWriteSerial(every,slOpen,dX,dX,0,0,aw{abl.d0}, NoRead) }, 
{ExecWrite (bmD, NoCsoOpt ) } , 
EndFor ( ) } ; 

The evaluation of this task using the invention would result in completion of this 
application in 240 clock cycles instead of 304 clock cycles using the network shown in 
Figure 2, which represents the prior art. The effect is far more significant in the next 
planned ASP device, which will incorporate 4096 APEs. Here completion of the 
algorithm will require some 400 clock cycles, compared to some 2500 clock cycles 
without the invention. 



Extension of the concept to support yield-enhancement 

The network topology allows the subnet (at segment, block, chip etc. levels) to be readily 
by passed i n the event of a de fect wh ic h results in faulty operation. For example, Figure 
9 shows the addition of an isolation gate in the ALTERNATE NETWORK output, 
forcing the output to a null state in the event of a fault in the network. By forcing the 
partition switch (P) unconditionally closed we may exploiting existing hardware to 
allow any forward propagating correction terms to unconditionally bypass the segment 
(or block, chip etc). 
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Figure 9 Network amendments for fault-tolerant bypassing 



An Implementation 

An implementation of the invention, in the VASP-4096/TX chip has completed design. 
It completes a full alternation across 4096 data parallel processor in a simulated 
interval of 24ns - or 5 clock cycles at 160MHz. It is implemented in a single 0.25|j.m 
silicon process. 

Benefits 

The main benefits of the invention are: 



• Speed. The ability of complete an alternating communication operation across 4096 
APEs in <30ns. 

• Size. The network at the first level of the propagation hierarchy is essentially 
identical to existing examples. The network required to generate the correction 
terms can be easily accommodated within the existing floorplan without significant 
area impact. 

• Cost and Reliability. The network is amenable to implementing fault-tolerant 
bypassing of failed segments and blocks with only minor amendments, thereby 
adhering to compatibility with the rest of the communication network which exploits 
similar fault-tolerance. This results in higher yields land significant cost savings in 
volume manufactured parts. 

These benefits open number of new application areas. The application of this 
network to VASP processors allows the architecture to implement divide-and-conquer 
algorithms on parallel data sets much more efficiently. The creation of even-odd sets in 
an associative processor must be able to effectively exploit the non-deterministic nature 
of the processor, which allows participating elements of the initial active set to be 
distributed across a large number of physical processors with indeterminate 
physical/logical locations. 

The overall device cost is not only reduced by the small size of the network but also: 

• it conforms "to the general communication link topology, 
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• it is amenable to fault-tolerance through selective bypassing, thereby opening the 
possibility of significant savings in chip manufacturing cost, and 

• it accommodates controllable inject states with a minimum of hardware. 
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